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Abstract 

Event  summarization  is  an  effective  process  that  mines  and 
organizes  event  patterns  to  represent  the  original  events.  It 
allows  the  analysts  to  quickly  gain  the  general  idea  of  the 
events.  In  recent  years,  several  event  summarization  algo¬ 
rithms  have  been  proposed,  but  they  all  focus  on  how  to  find 
out  the  optimal  summarization  results,  and  are  designed  for 
one-time  analysis.  As  event  summarization  is  a  comprehen¬ 
sive  analysis  work,  merely  handling  this  problem  with  a  sin¬ 
gle  optimal  algorithm  is  not  enough. 

In  the  absence  of  an  integrated  summarization  solution, 
we  propose  an  extensible  framework  -  META  -  to  enable  an¬ 
alysts  to  easily  and  selectively  extract  and  summarize  events 
from  different  views  with  different  resolutions.  In  this  frame¬ 
work,  we  store  the  original  events  in  a  carefully-designed 
data  structure  that  enables  an  efficient  storage  and  multi¬ 
resolution  analysis.  On  top  of  the  data  model,  we  define  a 
summarization  language  that  includes  a  set  of  atomic  opera¬ 
tors  to  manipulate  the  meta-data.  Furthermore,  we  present  5 
commonly  used  summarization  tasks,  and  show  that  all  these 
tasks  can  be  easily  expressed  by  the  language.  Experimental 
evaluation  on  both  real  and  synthetic  datasets  demonstrates 
the  efficiency  and  effectiveness  of  our  framework. 

1  Introduction 

Event  summarization  is  a  process  that  mines  and  organizes 
event  patterns  to  represent  the  original  event  sets,  so  that 
analysts  can  understand  the  system  behaviors.  Different 
from  traditional  frequent  pattern  mining  techniques  that  sim¬ 
ply  discover  patterns,  event  summarization  provides  a  brief 
yet  accurate  summary  for  event  datasets.  These  summaries 
smooth  the  learning  curve  of  understanding  the  system  and 
give  the  analysts  insightful  hints  before  conducting  deep 
analysis  with  advanced  data  mining  techniques. 

Some  research  efforts  have  been  working  on  providing 
various  summarization  methods  GO  [nil!  mHz).  Each  of 
them  defines  its  own  way  of  summarizing  events/documents. 
On  the  other  hand,  there  are  also  some  efforts  Ifl5l [3ll  working 
on  providing  various  techniques  for  presenting  event  sum¬ 
marization  results.  From  all  these  explorations,  we  can  con¬ 
clude  that  event  summarization  is  not  a  problem  that  can  be 
handled  by  a  single  model  or  algorithm.  For  different  users 


or  for  different  purposes,  there  are  various  ways  of  conduct¬ 
ing  event  summarization,  and  also  many  parameters  to  be  set. 
To  obtain  an  event  summary  from  different  perspectives,  an 
analyst  has  to  re-preprocess  the  data  and  change  the  program 
time  after  time.  This  is  a  drain  of  analysts’  productivity. 

The  predicament  is  very  similar  to  that  of  the  time  when 
every  data-intensive  task  has  to  use  a  separate  program  for 
data  manipulation.  The  data  representation  and  query  prob¬ 
lem  were  eventually  addressed  by  the  ER  model  and  SQL. 
Following  the  historical  path  of  DBMS  and  query  languages, 
we  believe  event  summarization  (as  well  as  event  analysis) 
should  also  be  abstracted  to  an  independent  software  system 
with  a  uniform  data  model  and  an  expressive  query  language. 

An  event  summarization  system  has  to  be  flexible 
enough,  so  that  the  real-life  scenarios  can  be  adequately  and 
efficiently  handled  and  supported.  The  followings  are  some 
typical  scenarios  that  an  event  analyst  would  encounter. 

SCENARIO  1 .  An  analyst  obtains  a  system  log  of  the  whole 
year,  but  he  only  wants  to  view  the  summary  of  the  events 
that  are  recorded  between  the  latest  30  days.  Moreover,  he 
wants  to  see  the  summary  without  the  trivial  event  “firewall 
scan  ”.  Also,  he  wants  to  see  the  summarization  with  the 
hourly  granularity. 

SCENARIO  2.  After  viewing  summarization  results,  the  an¬ 
alyst  suspects  that  one  particular  time  period  of  events  be¬ 
haves  abnormally,  so  he  wants  to  conduct  anomaly  detection 
just  for  that  period  to  find  out  more  details. 

SCENARIO  3.  The  system  has  generated  a  new  set  of  secu¬ 
rity  log  for  the  current  week.  The  analyst  wants  to  merge 
the  new  log  into  the  repository  and  also  to  summarize  the 
merged  log  with  the  daily  granularity. 

To  handle  the  work  in  the  first  scenario  using  existing 
event  summarization  methods,  we  need  to  perform  the  fol¬ 
lowing  tasks:  (1)  Write  a  program  or  use  the  existing  pro¬ 
gram  to  extract  the  events  occurred  during  the  specified  time 
range;  (2)  Write  or  leverage  existing  program  to  remove  the 
irrelevant  event  types;  (3)  Write  or  leverage  existing  pro¬ 
gram  to  aggregate  the  events  by  hour;  and  (4)  Feed  the  pre- 
processed  events  to  existing  event  summarization  methods 
to  obtain  the  summary.  Similarly,  about  the  same  amount  of 
works  are  needed  for  the  second  and  third  scenarios.  Note 
that,  if  parameter  tuning  is  needed,  a  typical  summarization 
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task  requires  hundreds  of  such  iterations  in  the  aforemen¬ 
tioned  scenarios.  Therefore,  it  is  inefficient  and  tedious. 

Similar  to  OLAP  as  an  exploration  process  for  trans¬ 
actional  data,  event  summarization  is  also  a  trial-and-error 
process  for  temporal  event  data.  As  event  summarization 
requires  repetitive  exploration  of  the  events  from  different 
views,  we  believe  it  is  necessary  to  have  an  integrated 
framework  to  enable  users  to  easily,  interactively,  and  se¬ 
lectively  extract,  summarize,  and  analyze  the  temporal  event 
data.  Event  summarization  should  be  the  first  step  of  any 
other  mining  tasks,  and  its  goal  is  to  enable  the  analysts  to 
quickly  gain  the  general  idea  of  the  events.  Similar  to  FIU- 
Miner  (21]  that  supports  rapid  task  configuration,  the  event 
summarization  framework  should  allow  analysts  to  easily 
compose  various  summarization  and  analysis  tasks,  and  then 
to  efficiently  execute  them. 

To  satisfy  the  above  requirements,  we  propose  an  ex¬ 
tensible  event  summarization  framework  called  META  to  fa¬ 
cilitate  the  multi-resolution  summarization  as  well  as  its  as¬ 
sociated  tasks.  Instead  of  inventing  new  summarization  al¬ 
gorithms,  we  focus  on  filling  the  missing  component  of  the 
event  summarization  task  and  making  it  a  complete  knowl¬ 
edge  discovery  process.  Therefore,  our  work  is  complemen¬ 
tary  and  orthogonal  to  previous  works  that  focus  on  propos¬ 
ing  different  event  summarization  algorithms. 

We  design  META  with  the  following  principles:  1)  the 
framework  should  be  flexible  enough  to  accommodate  many 
real-life  scenarios;  and  2)  the  framework  should  ease  the 
summarization  tasks  implementation  as  much  as  possible. 
Figure  Q]  shows  the  corresponding  workflows  of  conducting 
the  above  scenarios  with  the  META  framework,  including 
ad-hoc  summarization,  events  storing,  recovering,  updating, 
and  merging.  For  each  scenario,  the  analyst  only  needs  to 
write  and  execute  a  short  piece  of  script. 
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Figure  1 :  Summarization  workflows  of  example  scenarios 
1.1  Contributions  The  contributions  of  this  paper  are 
listed  as  follows:  (1)  We  present  a  multi-resolution  data 
model  called  summarization  forest  to  efficiently  store  the 
event  sequences  as  well  as  the  necessary  meta-data.  Sum¬ 


marization  forest  is  designed  to  store  and  represent  the  event 
sequence  in  multi-resolution  views  with  specified  precision. 
(2)  We  define  a  summarization  language  which  includes  a 
set  of  basic  operations  for  expressing  summarization  tasks. 
Each  basic  operation  is  an  atomic  operation  that  directly 
operates  the  data.  (3)  We  introduce  five  commonly  used 
event  summarization  tasks,  including  ad-hoc  summarization, 
event  storing,  recovering,  updating,  and  merging.  We  also 
show  that  these  tasks  can  be  expressed  by  the  language.  (4) 
We  conduct  a  series  of  experiments  on  both  real  and  syn¬ 
thetic  event  sequences  to  demonstrate  the  effectiveness,  con¬ 
venience,  and  efficiency  of  our  proposed  framework. 

2  Related  Works 

Several  works  focusing  on  leveraging  data  mining  and  data 
processing  techniques  for  event  analysis  have  been  proposed 
in  recent  years.  According  to  the  functionalities,  they  can  be 
categorized  as:  event  log  pre-processing,  event  summariza¬ 
tion,  and  event  based  system  analysis. 

In  general,  the  event  logs  obtained  from  systems  are  un¬ 
structured/  semi-structured  and  are  not  immediately  avail¬ 
able  for  analysis.  Researchers  have  proposed  event  format 
standards  such  as  Common  Event  Expression  (3)  and  Event 
Relationship  Network  1(131  to  describe  all  the  event  logs.  Un¬ 
fortunately,  these  representations  are  not  widely  adopted.  In 
order  to  convert  the  raw  logs  into  a  canonical  readable  for¬ 
mat,  many  efforts  are  needed  to  be  made  on  pre-processing 
the  logs  ID  ED-  They  utilize  various  techniques  such  as 
source  code  parsing,  clustering  and  substring  matching  to 
extract  the  template  from  the  raw  event  messages  and  then 
transform  them  into  structured  formats. 

Event  summarization  focuses  on  extracting  the  high 
level  overview  from  event  log.  Before  directly  diving  into 
the  details,  it  is  a  good  choice  for  the  analysts  to  see  the 
summary  first.  Peng  et  al.  Q4)  proposed  an  approach  to 
find  the  dependency  among  events  by  measuring  inter-arrival 
distribution  of  the  event.  Kiernan  et  al.  0  summarized  the 
events  by  segmenting  the  event  sequence  according  to  the 
frequency  changes.  Wang  et  al.  m  further  extended  Kier- 
nan’s  work  by  presenting  the  inter-segment  relationship  with 
HMM.  liang  et  al.  f4j  provided  a  richer  summarization  of  the 
events  by  providing  the  event  relationship  network  (ERN) 
based  on  the  logs,  which  captures  the  temporal  dynamics. 
To  the  extent  of  our  knowledge,  existing  research  mainly  fo¬ 
cused  on  developing  approaches  to  find  the  optimal  summa¬ 
rization  results,  while  our  work  is  to  present  a  comprehensive 
and  extensible  framework  to  facilitate  the  multi-resolution 
summarization  for  the  system  analysts. 

Event  log  based  system  analysis  focuses  on  revealing 
the  hidden  problems  of  the  systems.  Different  analysis  tasks 
pay  attention  to  different  application  aspects,  such  as  system 
failure  tracing  lfl9l FTOl,  event  correlation  discovery  l20l|T8l 
ESI,  and  event  based  trend  analysis  0113.  In  practice, 
these  methods  are  often  conducted  when  the  analysts  already 
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have  some  prior  knowledge  about  the  data. 

3  The  Multi-Resolution  Data  Model 

An  event  sequence  can  be  represented  in  the  form  of  event 
record  sequence  D  =  (<  ii,ei  >,<  £2,^2 
tn .  e.n  >),  where  t,  is  the  time  when  an  event  occurs  and 
6i  denotes  the  event  instance  with  an  associated  ‘type’. 
Each  event  instance  belongs  to  one  of  the  m  types  £  = 
{ei,  Note  that  the  ‘type’  is  a  generic  terminology. 

Any  combination  of  the  features  of  an  event  can  be  used 
as  the  ‘type’,  e.g.  the  event  category  and  event  name  in 
combination  can  be  used  as  the  event  type.  In  this  section, 
we  first  describe  how  to  use  an  event  vector,  an  intermediate 
data  structure,  to  represent  the  event  sequence.  Then  we 
introduce  summarization  forest  (SF),  the  data  model  to  store 
event  sequences  with  multiple  resolutions. 

3.1  Vector  Representation  of  Event  Occurrences  Given 
an  event  sequence  D  with  m  event  types  and  time  range 
[ts,te\,  we  decompose  D  into  m  subsequences  D  = 
(Dei , ...,  Dem),  each  contains  the  instances  of  one  event 
type.  Afterwards,  we  convert  each  U7  into  an  event  vector 
Vi,  where  the  indexes  indicate  the  time  and  the  values  indi¬ 
cate  the  number  of  event  occurrences.  During  conversion,  we 
constrain  the  length  of  each  vector  to  be  2 l,l  £  Z+ ,  where  l 
is  the  smallest  value  that  satisfies  2l  >  te  —  ts .  In  the  vector, 
the  first  te  —  ts  entries  would  record  the  actual  occurrences 
of  the  event  instances,  and  the  remaining  entries  are  filled 
with  O’s.  Example  Q]  provides  a  simple  illustration  on  how 
we  convert  the  event  sequence. 

EXAMPLE  1.  The  left  figure  in  Figure  [2]  gives  an  event  se¬ 
quence  containing  3  event  types  within  time  range  [ii ,  ^12]- 
The  right  figure  shows  the  conversion  result  of  the  given 
event  sequence.  Note  that  the  original  event  sequence  is 
decomposed  into  3  subsequences.  Each  subsequence  rep¬ 
resenting  one  event  type  is  converted  to  a  vector  with  length 
16.  The  numbers  in  bold  indicate  the  actual  occurrences  of 
the  events,  and  the  remaining  numbers  are  filled  with  O’s. 


c  c  c 

c  c  c 

A 

A:  1110001110000000 

B 

b  0 

B:0000001000010000 

AA  A 

AA  A 

V 

COOlllOOlllOOOOOO 

123456789  10  11  12  t 


Figure  2:  Convert  the  original  event  sequence  to  the  vectors 
Vectors  intuitively  describe  the  occurrences  of  events, 
but  this  kind  of  representation  is  neither  storage  efficient 
(as  it  requires  0(\£ |n))  nor  analysis  efficient  (as  it  does  not 
support  multi-resolution  analysis).  To  facilitate  the  storage 
and  analysis,  we  propose  summarization  tree  to  model  the 
event  occurrences  of  a  single  type.  Furthermore,  we  propose 
summarization  forest  to  model  the  event  occurrences  of  the 
whole  event  log. 

3.2  Summarization  Tree  The  summarization  tree  is  used 
to  store  the  event  occurrences  for  a  single  event  type.  It 
is  capable  of  providing  both  frequency  and  locality  of  oc¬ 


1  6  1 

'coarse 

|  6  I  summary  node 

1  5  II - 1 - 1 

1 

1  -4  |  description  nodes 'Granularity  4 

1  3  II  2  ||  1  ||  0  | 

0: 

1  II  'Granularity  3 

I2I1I0I2I1I0I0I0I 

1  -1  | 

|  2  |  f~Y~1  |  Q  |  ^Granularity  2 

11  10  00  11  10  00  00  00 

ToUcT 

|  0  1 1  0  1 1  -i  1 1  0  1 1  0  1 1  0  1  jGranularity  1 

time 

Figure  3:  Relationship  between  vector  and  ST 


currences  simultaneously.  Moreover,  it  satisfies  the  multi¬ 
resolution  analysis  (MRA)  d  requirements  by  represent¬ 
ing  the  event  occurrences  with  various  subspaces.  This  prop¬ 
erty  enables  the  analysts  to  choose  a  proper  subspace  to  view 
the  data  at  a  corresponding  granularity.  The  summarization 
tree  is  formally  defined  below. 

Definition  3.1.  A  summarization  tree  (ST)  is  a  balanced 
tree  where  all  nodes  store  the  temporal  information  about  the 
occurrences  of  events.  The  tree  has  the  following  properties: 

.  Each  summarization  tree  has  two  types  of  nodes:  summary 
node  and  description  nodes. 

'.  The  root  is  a  summary  node,  and  it  has  only  one  child.  The 
root  stores  the  total  occurrences  of  the  events  throughout 
the  event  sequence. 

'.  All  the  other  nodes  are  description  nodes.  They  either  have 
two  children  or  no  child.  These  nodes  store  the  frequency 
difference  between  adjacent  chunks  ( the  frequency  of  the 
first  chunk  subtracted  by  that  of  its  following  chunk)  of 
sequence  described  by  lower  level  nodes. 

'.  The  height  of  the  summarization  tree  is  the  number  of  levels 
of  the  description  tree.  The  height  of  a  node  in  tree  is  the 
counted  from  bottom  to  top,  starting  from  0.  The  nodes  at 
height  i  store  the  frequency  differences  that  can  be  used  to 
obtain  the  temporal  information  of  granularity  i. 

Considering  event  type  A  in  Example  [U  Figure  [3  shows 
its  vector  and  the  corresponding  summarization  tree.  As 
illustrated  in  the  figure,  the  summarization  tree  stores  the 
sum  of  the  occurrences  frequency  (6  occurrences)  at  the  root 
node,  and  the  frequency  differences  (within  the  dashed  box) 
in  the  description  nodes  at  various  granularities.  Note  that 
at  the  same  level  of  the  tree,  the  description  nodes  store 
the  differences  between  adjacent  sequence  chunks  at  the 
same  granularity.  The  larger  the  depth,  the  more  detailed 
differences  they  store.  For  example,  at  granularity  1,  every 
two  adjacent  time  slots  in  the  original  event  sequence  are 
grouped  into  one  chunk,  and  the  grouped  event  sequence  is 
‘21021000’.  Correspondingly,  in  the  summarization  tree,  the 
frequency  differences  of  each  adjacent  time  slot  (0,  —1,  0,  0, 
—  1,0, 0,0)  are  recorded  at  the  leaf  level.  Similarly,  the 
frequency  differences  at  various  granularities  are  recorded 
in  the  description  nodes  at  the  corresponding  levels. 

It  is  clear  that  the  space  complexity  of  the  summariza¬ 
tion  tree  is  0(|T|),  where  \T\  =  n  and  n  is  the  length  of 
the  vector.  From  the  storage  perspective,  directly  storing  the 
tree  has  no  benefits  for  space  saving.  Basically,  there  are  two 
ways  to  reduce  the  space  complexity  of  summarization  tree: 
detail  pruning  and  sparsity  storage. 
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3.2.1  Detail  Pruning  In  practice,  analysts  only  care  about 
the  high-level  overview  of  the  event  occurrences.  Conse¬ 
quently,  there  is  no  need  to  store  all  the  details  of  the  event 
sequences.  As  the  summarization  tree  describes  the  event 
occurrences  in  a  top-down  manner  —  a  coarse-to-fine  strat¬ 
egy,  we  can  save  the  storage  by  removing  the  lower  lev¬ 
els  of  the  description  nodes.  The  pruned  tree  still  contains 
enough  details  for  analysis,  and  an  analyst  who  analyzes  a 
long-term  event  log  would  not  care  about  the  event  occur¬ 
rences  at  the  second  precision.  Due  to  the  hierarchical  struc¬ 
ture  of  the  tree,  we  can  reduce  the  storage  space  exponen¬ 
tially.  Lemma  irn  shows  how  much  space  can  be  reduced 
through  pruning.  For  example,  the  original  tree  has  a  height 
of  14  levels  and  8192  (or  213)  nodes.  If  we  prune  the  tree 
by  removing  the  last  6  levels,  the  size  of  tree  will  become 
^jTj  =  128,  which  is  only  about  1.5%  of  the  original  size. 
The  pruned  tree  is  still  able  to  describe  the  event  occurrences 
with  1  -minute  granularity. 

LEMMA  3.1.  Suppose  the  height  of  summarization  tree  is 
H,  if  we  only  keep  the  nodes  with  a  height  larger  than  or 
equal  to  k,  the  size  of  the  pruned  tree  is  2H~k. 

Proof.  According  to  the  property  3  of  the  definition  of  sum¬ 
marization  tree,  besides  the  summarization  node,  the  sum¬ 
marization  tree  is  a  perfect  binary  tree.  If  only  the  nodes  with 
height  larger  than  or  equal  to  k  are  kept,  the  size  of  remain¬ 
ing  nodes  in  perfect  binary  tree  part  is  2H~1~l  = 

2H~k  —  1.  Therefore,  the  total  size  of  the  summarization 
tree  after  pruning  is  2n  k . 

3.2.2  Sparsity  Storage  Another  way  to  reduce  the  space 
is  to  only  store  the  non-empty  nodes  of  the  tree.  The  majority 
of  the  event  types  rarely  appear  in  the  event  sequence.  In  this 
case,  the  corresponding  vector  will  be  dominated  with  O’s. 
Accordingly,  the  transformed  summarization  tree  will  also 
contain  many  O’s.  For  example,  event  type  X  only  occurs 
twice  throughout  a  2-hour  (7200  second)  event  sequence. 
The  first  occurrence  is  the  first  second,  and  the  second 
occurrence  is  the  second  second.  The  number  of  nodes  in 
the  corresponding  summarization  tree  is  8192,  but  there  are 
only  28  non-zero  nodes.  Lemma  13.21  pro v ides  a  lower  bound 
on  how  many  zero  nodes  exist  in  a  summarization  tree. 
LEMMA  3.2.  Suppose  the  occurrence  proportion  (the  prob¬ 
ability  of  occurrences  at  any  time)  of  event  type  X  is  r  = 

where  n  is  the  length  of  vector  that  stores  the  event  oc¬ 
currences.  For  the  corresponding  summarization  tree,  the 
proportion  of  zero  nodes  at  height  h  is  =  max(l  — 
2h+1r,0). 

Proof.  The  proof  can  be  found  in  Appendix. 

Based  on  Lemma[3~TI and [3~2l  we  further  show  the  space 
complexity  of  a  summarization  tree  in  Theoreml3.ll 

THEOREM  3.1.  The  space  complexity  of  a  summariza- 

i/yii  _ yH  in 

tion  tree  with  granularity  k  is  —  Yhi=kmax\1iF  ~ 


2h+1r,  0)),  where  \T\  is  the  length  of  the  vector,  H  is  the 
height  of  the  summarization  tree,  and  r  is  the  occurrence 
proportion  as  described  in  LemmaU72\ 

Proof.  The  proof  is  based  on  Lemma  13.11  and  Lemma  13.21 
The  number  of  nodes  with  the  height  (granularity)  larger 
than  or  equal  to  k  is  2H~k  =  ip  according  to  Lemma IXTl 
For  each  level  h  >  k,  the  number  of  zero  nodes  is  rihP  = 
TOaa;(pr  —  2 h+1r,  0),  and  the  sum  of  all  nodes  with  height 

larger  than  or  equal  to  k  is  W  ~  2h+1r.  Therefore, 

the  number  of  non-zero  nodes  in  the  summarization  tree  is 

w  -  max(lw  -  2h+lr . °)- 

It  is  true  that  the  second  term  will  become  0  when 
r  is  sufficiently  large.  However,  based  on  the  empirical 
study,  most  of  the  event  types  occur  rarely,  and  therefore 
0  <  r  <  1. 

3.3  Summarization  Forest  Summarization  forest  is  a  data 
model  which  contains  all  the  summarization  trees.  In  one 
forest,  there  are  \£ |  summarization  trees.  Each  stores  the 
events  of  one  event  type.  Besides  trees,  the  summarization 
forest  also  stores  the  necessary  meta-data.  The  summariza¬ 
tion  forest  is  formally  defined  in  Definitionl3.2l 
Definition  3.2.  A  summarization  forest  (SF)  is  a  6-tuple 
T  =<.  £,  T,  ts ,  te,  /,  r  >,  where: 

1.  £  denotes  the  set  of  the  event  types  in  the  event  sequence. 

2.  T  denotes  the  set  of  summarization  trees. 

3.  ts  and  te  denote  the  start  timestamp  and  end  timestamp  of 
the  event  sequence  represented  by  T . 

4.  I  denotes  the  full  size  of  each  ST,  including  the  zero  and 
non-zero  nodes.  All  the  trees  have  the  same  full  size. 

5.  r  denotes  the  resolution  of  each  ST.  All  the  trees  are  in  the 
same  resolution. 

Note  that  since  the  summarization  trees  are  stored  in 
sparsity  style,  the  actual  number  of  nodes  that  are  stored  for 
each  tree  can  be  different  and  should  be  much  less  than  the 
full  size.  Given  a  summarization  forest,  we  can  recover  the 
original  event  sequences. 

4  Basic  Operations 

In  this  section,  we  propose  a  set  of  basic  operators  which  are 
built  on  top  of  the  data  model  we  proposed.  These  operators 
form  the  summarization  language,  which  is  the  foundation  of 
the  event  summarization  tasks  presented  in  our  framework. 
The  motivation  of  proposing  a  summarization  language  is 
to  make  the  event  summarization  flexible  and  allow  the 
advanced  analysts  to  define  the  ad-hoc  summarization  tasks 
to  meet  the  potential  new  needs. 

The  basic  operators  are  categorized  into  two  families: 
the  data  transformation  operators  and  the  data  query  opera¬ 
tors.  The  operators  of  the  first  family  focus  on  transforming 
data  from  one  type  to  another,  and  they  are  not  directly  used 
for  summarization  work.  The  operators  of  the  second  family 
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Operation 

Symbol 

Description 

Vectorize 

°(A) 

Vectorize  the  subsequence  Di. 

Unvectorize 

•(L) 

Unvectorize  the  vector  Vi. 

Encode 

<V) 

Encode  Vi  into  a  summarization  tree  Ti. 

Decode 

>Pi) 

Decode  Ti  back  to  vector  Vi . 

Prune 

e(T>) 

Prune  the  most  detailed  information  of  Ti . 

Concatenate 

T\  tfcl  J~2 

Concatenate  two  SF  and  J~2 . 

Project 

neri . (P) 

Extract  events  of  types  em, ...,  etk)  from  T . 

Select 

Pick  the  events  occurs  between  time  [ti,  £2]- 

Zoom 

Ti{P) 

Aggregate  the  events  with  granularity  u. 

Describe 

T 

A  name 

Use  algorithm  name  for  event  summarization. 

Table  1 :  Notations  of  basic  operations 


focus  on  retrieving/manipulating  data  in  read-only  way,  and 
they  provide  the  flexibility  of  generating  the  summarization. 
To  make  the  notations  easy  to  follow,  we  list  all  the  symbols 
of  all  these  operations  in  Table  Q]  We  will  introduce  their 
meanings  later  in  this  section. 

4.1  Data  Transformation  Operators  The  data  transfor¬ 
mation  operators  includes  vectorize,  unvectorize,  encode,  de¬ 
code,  prune,  and  concatenate.  Their  functionalities  are  listed 
as  follows: 

Vectorize  and  Unvectorize:  Vectorize  is  used  to  convert  the 
single  event  type  subsequence  Di  into  a  vector  V'j  while  un¬ 
vectorize  does  the  reverse  work.  Both  of  them  are  unary, 
and  represented  by  symbol  o  and  •,  respectively.  Semanti¬ 
cally,  these  two  operators  are  complementary  operators,  i.e. 
Di  =  *(o (Di))  and  Vi  =  o(*(Ui)). 

Encode  and  Decode:  Encode  is  used  to  convert  the  vector  Vt 
into  a  summarization  tree  while  decode  does  the  reverse 
work.  Similar  to  vectorize/unvectorize,  Encode  and  decode 
are  complementary  operators  and  both  of  them  are  unary.  We 
use  symbol  <d  and  >  to  denote  them  respectively. 

Prune:  The  operator  Prune  is  unary,  and  it  conducts  on  the 
summarization  tree.  It  is  used  to  remove  the  most  detailed 
information  of  the  events  by  pruning  the  leaves  of  a  summa¬ 
rization  tree.  Note  that  this  operator  is  irrecoverable.  Once  it 
is  used,  the  target  summarization  tree  will  permanently  lose 
the  removed  level.  We  use  ©  to  denote  this  operator. 

Concatenate:  The  operator  concatenate  is  a  binary  opera¬ 
tor.  It  combines  two  SFs  into  a  big  one  and  also  updates  the 
meta-data.  We  use  l±J  to  denote  this  operation.  Note  that  only 
the  SFs  with  the  same  resolution  can  be  concatenated. 

4.2  Data  Query  Operators  The  data  query  operators  in¬ 
clude  select,  project,  zoom,  and  describe.  They  all  take  a 
summarization  forest  T  as  the  input.  The  data  query  opera¬ 
tors  are  similar  to  the  Data  Manipulation  Language  (DML) 
in  SQL,  which  provides  query  flexibility  to  users. 

Their  functionalities  are  listed  as  follows: 

Project:  The  operator  project  is  similar  to  the  ‘projec¬ 
tion’  in  relational  algebra.  It  is  a  unary  operator  written  as 
ne(1),e(2),...,e(fc)  V)-  The  operation  is  defined  as  picking  the 
summarization  trees  whose  event  types  are  in  the  subset  of 
{e(i)>  •••>  epfe)}  C  £. 


Select:  The  operator  select  is  similar  to  the  ‘selection’  in  re¬ 
lational  algebra.  It  is  a  unary  operator  written  as  <7[tl  ,t2](^r)- 

Zoom:  The  operator  zoom  is  used  to  control  the  resolution 
of  the  data.  It  is  a  unary  operator  written  as  tu(J-),  where  u 
is  the  assigned  resolution,  the  larger,  the  coarser. 

Describe:  The  describe  operator  indicates  which  algorithm 
is  used  to  summarize  the  events.  Its  implementation  depends 
on  the  concrete  algorithm  and  .  all  the  previous  event  sum¬ 
marization  papers  can  be  regarded  as  proposing  a  concrete 
describe  operator.  For  example,  11  summarize  the  events 
with  periodic  and  inter-arrival  relationships.  The  describe 
operation  is  written  as  Tname(T),  where  name  is  the  name 
of  summarization  algorithm  used  for  describing  the  events. 
If  necessary,  the  analyst  can  implement  her/his  own  describe 
algorithm  that  follows  the  specification  of  our  framework.  In 
our  implementation,  the  time  complexity  of  all  these  opera¬ 
tors  are  lower  than  0(|£||T|  log  |T|)  =  0(|£|n  log  n). 

5  Event  Summarization  Tasks 

Considering  the  requirements  of  the  analysts  discussed  in  In¬ 
troduction,  we  introduce  five  commonly  used  event  summa¬ 
rization  tasks:  summarization,  storing,  recovering,  merging, 
and  updating,  using  the  previously  defined  basic  operators  as 
the  building  blocks.  The  intention  here  is  to  demonstrate  the 
expressive  capability  of  the  basic  operators,  instead  of  giving 
a  thorough  coverage  of  all  the  possible  tasks. 

5.1  Summarization  Task  Summarization  task  is  the  core 
of  event  summarization,  and  all  prior  works  about  event 
summarization  focus  on  this  problem.  Based  on  the  defined 
basic  operators,  analysts  can  summarize  the  events  in  a 
flexible  way.  In  our  framework,  any  summarization  task  can 
be  described  by  the  following  expression: 

Tname(^tl,t2]r:U*E&v{£)(T)). 

The  symbol  *  denotes  conducting  the  operation  0+  times. 
With  the  combination  of  operators,  the  analysts  are  able  to 
summarize  any  subset  of  events  in  any  resolution  during  any 
time  range  with  any  summarization  algorithm. 

One  thing  should  be  noted  is  that  the  order  of  the  opera¬ 
tors  can  be  changed,  but  the  summarization  results  of  differ¬ 
ent  orders  are  not  guaranteed  to  be  the  same.  For  example, 
a  commonly  used  implementation  of  the  describe  operator  is 
based  on  the  minimum  description  length  principle  0]  |8j. 
Such  implementation  aims  to  find  a  model  that  describes 
the  events  with  least  information.  Therefore,  the  results  of 
T„ame(ru(J'))  and  r„( Tname(J))  are  possibly  different. 

5.2  Storing  Task  Storing  is  an  important  task.  Converting 
the  raw  event  log  time  after  time  is  time-consuming  with 
low  management  efficiency.  This  task  enables  the  analysts 
to  convert  the  events  into  a  uniform  data  mode  only  once  and 
reuse  it  afterwards.  The  store  task  can  be  written  as: 

r=  U  ©*(«(°(A ))), 

ei££i 
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where  Si  denotes  the  set  of  event  types  that  the  analysts  are 
interested  in,  and  (J  denotes  putting  all  the  trees  together  to 
form  the  SF.  The  analysts  are  able  to  pick  any  time  resolution 
and  any  subset  of  all  the  event  types  for  storage. 

5.3  Recovering  Task  Recovering  task  is  the  link  between 
the  event  summarization  and  other  data  mining  tasks.  After 
finding  the  interesting  piece  of  event  logs  via  the  summa¬ 
rization  results,  the  analysts  should  be  able  to  transform  the 
selected  portion  of  SF  back  to  its  original  events,  so  they  can 
use  other  data  mining  techniques  for  further  analysis.  The 
recover  task  can  be  expressed  as: 

•(>Kfl,l2](^(n^£f(J))))). 

This  expression  shows  that  the  analysts  can  selectively 
recover  the  piece  of  events  with  any  subset  of  event  types,  at 
any  time  range  and  any  time  resolution. 

5.4  Merging  and  Updating  Tasks  Both  merging  and  up¬ 
dating  tasks  focus  on  the  maintenance  of  stored  SF,  but  their 
motivations  are  different. 

The  merging  task  is  conducted  when  the  analysts  obtain 
the  SFs  with  disjoint  time  periods  and  want  to  archive  them 
altogether.  Suppose  T\  and  A  denote  two  SFs,  where  A 
contains  more  details  (contains  lower  resolution  level).  The 
merging  task  can  be  expressed  as: 

Fnew  = 

As  shown  in  the  above  expression,  when  we  merge  two 
summarization  trees  with  different  resolutions,  the  SF  with 
higher  granularity  would  be  pruned  to  meet  the  SF  with 
lower  granularity.  Then  these  two  SFs  would  be  merged  with 
the  concatenate  operation. 

Updating  task  is  conducted  when  the  analysts  want  to 
update  the  existing  SF  with  a  new  piece  of  event  log.  It  can 
be  expressed  by  basic  operators  as  follows: 

FneW=?\$(  IJ  e*«°(A)))), 

eiSLEi 

where  the  operand  of  |J  is  similar  to  the  operand  of  |J  in 
storing  task.  Firstly,  the  new  set  of  subsequence  A  will  be 
vectorized  and  then  encoded  into  a  SF  T .  Then  the  new  SF 
would  be  merged  into  the  old  SF  same  as  the  merge  task. 

6  Experimental  Evaluation 

We  conduct  a  series  of  experiments  to  evaluate  our  proposed 
framework.  In  this  section,  we  do  not  focus  on  demonstrat¬ 
ing  the  meaningfulness  or  correctness  of  the  summarization 
results,  since  it  should  be  the  work  of  the  concrete  summa¬ 
rization  algorithm  designers.  Instead,  the  main  goal  of  the 
evaluation  is  to  explore  the  efficiency  and  the  effectiveness 
of  the  proposed  framework,  and  to  show  how  META  makes 
the  summarization  more  flexible  and  convenient.  More  con¬ 
cretely,  our  experiments  aim  to  answer  the  following  ques¬ 
tions:  (1)  What  is  the  cost  to  store  the  events  in  the  form  of 


SF?  (2)  How  efficient  is  it  to  retrieve  and  convert  the  data 
from  the  SF?  (3)  How  effective  and  flexible  can  our  frame¬ 
work  support  the  event  summarization?  and  (4)  What  about 
the  performance  of  the  updating  and  merging  tasks? 

In  addition  to  the  evaluation  of  META,  we  also  give  a 
case  study  to  show  how  META  facilitates  analysts  to  conduct 
event  summarization  tasks.  As  a  showcase,  we  leverage  the 
algorithm  proposed  in  |4j  as  the  summarization  algorithm, 
which  summarizes  the  events  from  the  perspective  of  inter¬ 
arrival  temporal  relationship. 

6.1  Storage  Cost  To  evaluate  the  storage  cost  of  SF, 
we  use  several  real  event  logs  across  different  OS  plat¬ 
forms  and  domains.  These  event  logs  are  collected  from 
customer’s  servers  by  IBM  service  department  and  the 
details  of  these  logs  are  listed  in  Table  [2]  (Available  at 
http://share.oIidu.com/events/).  These  datasets  are  different 
in  the  aspect  of  time  range,  event  occurrences,  occurrences 
frequency,  distinct  event  types,  and  log  record  styles. 


Name 

Domain 

Time  Units 

#Types 

secure-secure 

Security 

534,898 

14 

nokia-netview 

Network 

99,118,589 

15 

system-win 

System 

41,113,840 

64 

security-win 

Security 

5,579,292 

35 

application-win 

Application 

6,980,559 

61 

Table  2:  Features  of  real  datasets 


Table  0  illustrates  the  occurrence  proportion  of  the 
events  in  the  real  world  datasets  used  in  the  experiments. 
We  record  the  maximum,  average,  and  minimum  occurrence 
proportion  of  the  event  types  in  each  dataset.  Among  all 
the  datasets,  the  most  frequent  event  type  has  the  occurrence 
proportion  0.022,  indicating  the  event  occurs  only  22  out  of 
every  1000  time  slots  throughout  the  time  range  of  the  event 
sequences.  The  data  in  this  table  demonstrates  that  no  event 
type  occurs  all  the  time  (the  occurrence  proportion  r  <  1) 
in  real  world  situation.  Therefore,  the  second  term  of  the  O- 
notation  in  Theorem  13. II is  comparable  to  the  first  term,  and 
it  makes  the  theorem  meaningful. 


Maximum 

Average 

Minimum 

secure-linux 

0.005 

0.001 

3.739  x  10~6 

nokia-netview 

2.185  x  10“4 

4.064  x  10-5 

1.009  x  10“8 

system-win 

4.886  x  10“4 

1.413  x  10“4 

2.432  x  10“8 

security-win 

0.022 

9.381  x  10-4 

1.792  x  Itr7 

application-win 

0.003 

5.787  x  10“& 

1.432  x  10 Y 

Table  3:  Occurrence  proportion  in  real  datasets 


In  order  to  measure  storage  cost,  we  store  the  SFs 
as  binary  hies  using  object  serialization  technology.  We 
use  the  compression  ratio  (CR)  to  quantify  the  ratio  of 
SF  hies  comparing  with  the  original  log  hie,  i.e.,  CR  = 
— ?lze^lle),..  ..  To  further  save  the  storage  space,  we 
leverage  DEFLATE  algorithm  {2j  to  compress  the  serialized 
SF.  Figure |4] shows  the  compression  ratio  of  all  the  datasets. 
It  can  be  observed  that  all  the  stored  SFs  cost  less  storage 
space  comparing  with  the  original  logs  (CR  <  1).  More- 
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over,  after  compressed  by  DEFLATE,  even  the  worst  com¬ 
pressed  SF  costs  only  32.4%  of  the  space  of  the  original  log 
file.  This  fact  shows  that  storing  the  logs  as  SFs  can  save  the 
disk  space. 


Figure  4:  Compression  ratio  Figure  5:  Compression  ratio 
of  SF  and  compressed  SF  0f  SF  in  different  resolution 

The  storage  cost  can  be  further  reduced  if  the  low  level 
details  are  pruned.  Figure  [5]  shows  the  compression  ratio  of 
each  SF  in  different  7  resolutions  without  compression.  Note 
that  the  values  in  the  first  row  of  x-label  indicate  the  level  ^ 
of  resolution  we  store  the  SFs,  and  the  values  in  the  second 
row  indicate  the  corresponding  approximate  time  resolution. 
As  depicted  in  this  figure,  when  the  resolution  is  13  (hourly 
resolution),  even  the  most  costly  SF  uses  only  5%  of  space 
compared  with  the  corresponding  original  file. 

6.2  Efficiency  Evaluation  The  efficiency  evaluation  is 
conducted  in  3-fold.  Firstly,  we  evaluate  the  performance  ^ 
of  the  summarization  task  by  exploring  different  data  query 
operators  permutations.  Moreover,  we  measure  the  per¬ 
formance  of  storing  and  recovering  tasks  to  evaluate  the 
time  overhead  of  conducting  event  summarization  within  our 
framework.  Finally,  we  investigate  the  performance  of  merg¬ 
ing  and  updating  tasks  to  evaluate  the  maintenance  overhead. 

We  generate  15  synthetic  datasets  and  investigate  the 
performance  of  our  framework  on  different  datasets  by 
changing  3  properties  as  listed  in  Table  [J]  The  advantage 
of  using  synthetic  datasets  is  that  we  can  evaluate  the  perfor¬ 
mance  of  our  framework  with  different  properties  system¬ 
atically.  Since  here  we  only  investigate  the  efficiency,  the 
occurrences  of  events  are  randomly  generated. 


property 

values 

description 

#types 

20-100  step  20 

The  number  of  event  types. 

#events 

60k- 140k  step  20k 

The  number  of  event  occurrences. 

#ts 

10m-50m  step  10m 

The  time  slots  in  the  time  range. 

Table  4:  Properties  of  synthetic  datasets 


6.2.1  Performance  of  Summarization  Task  In  this  sec¬ 
tion,  we  evaluate  the  performance  of  all  data  query  operators 
except  describe.  The  reason  is  that  the  performance  of  de¬ 
scribe  depends  on  concrete  summarization  algorithms. 

Similar  to  DML  in  SQL,  the  performance  of  the  query 
varies  with  different  operator  permutations.  To  investigate 
how  the  order  affects  the  query  performance,  we  pick  three 
sets  of  synthetic  datasets  to  evaluate  the  time  cost  of  different 
project,  select  and  zoom  permutations.  In  each  set,  we 
fix  two  properties  and  changes  the  third  one.  For  project, 


we  pick  10%  of  the  event  types  from  the  SF.  For  select, 
we  pick  10%  of  the  time  range,  and  zoom  out  the  SF 
for  one  resolution.  Table  [5]  shows  the  running  time  of  all 
the  6  different  permutations  in  3  sets  of  experiments.  By 
examining  the  experiment  results  in  different  perspectives, 
we  can  obtain  following  observations: 

1 .  Different  operators  have  different  time  costs.  Table  0 
shows  that  select  is  the  most  time-consuming  and  project 
is  the  most  time-efficient.  In  our  experiments,  select  is 
102  ~  104  times  slower  than  zoom  and  project.  The  reason 
is  that  by  taking  advantage  of  the  SF,  zoom  only  needs  to 
remove  all  the  leaves  from  trees  in  0(log|T|)  time  and 
project  only  needs  to  remove  the  useless  trees  in  Oi\£ |) 
time.  However,  select  is  more  complicated  than  the  other 
two  operators.  It  builds  a  new  SF  by  extracting  events 
satisfying  the  select  parameters  from  the  old  SF,  which  takes 

°(lTllog|Tl)  time- 

Query  performance  varies  drastically  according  to  dif¬ 
ferent  operator  orders.  The  experiment  results  show  that 
the  fastest  query  costs  only  3%  the  time  of  the  slowest  query 
on  the  same  dataset.  As  mentioned  before,  select  is  the 
slowest  operator.  The  more  data  it  processes,  the  slower 
the  execution  would  be.  Therefore,  the  later  the  select  op¬ 
eration  is  conducted,  the  shorter  the  query  execution  time 
would  be. 

Query  performance  is  insensitive  to  #events.  According 
to  the  experiment  results  conducted  on  the  datasets  with  the 
same  #types  and  #ts  (1st  group),  the  query  performance  ap¬ 
pears  to  be  stable  when  #events  increases.  On  the  contrary, 
the  experiment  results  on  the  datasets  with  the  same  #events 
and  #ts  (2nd  group)  show  that  the  query  time  varies  linearly. 
Also,  the  results  are  similar  for  the  datasets  with  the  same 
#types  and  #events  but  with  different  #ts  (3rd  group). 

Based  on  the  above  observations,  to  avoid  unnecessary 
time  cost,  a  good  query  statement  should  postpone  the  select 
operation  as  much  as  possible.  In  our  prototype,  we  conduct 
simple  query  optimization  by  reordering  the  operators. 

6.2.2  Framework  Time  Overhead  The  tasks  of  storing 
and  recovering  are  not  directly  related  to  event  summariza¬ 
tion,  and  they  are  considered  as  overhead  for  summarization. 
We  conduct  experiments  on  the  same  sets  of  datasets  that  are 
used  in  Section[6.2.1l  For  each  datasets  sets,  we  investigate 
the  time  cost  of  storing  and  recovering  by  revealing  the  run¬ 
ning  time  of  involved  operators:  vectorize,  encode,  prune  for 
storing  task  and  decode,  unvectorize  for  recovering  task. 

Figure  [6]  shows  the  experiment  results  of  the  time  over¬ 
head,  where  the  first  bar  of  each  dataset  indicates  the  over¬ 
head  of  storing  task  and  the  second  bar  indicates  the  over¬ 
head  of  recovering  task.  From  the  experiment  results,  we 
obtain  two  following  observations.  Firstly,  all  the  experi¬ 
ments  cost  tens  of  seconds  to  finish  the  tasks.  Due  to  the  rare 
usage  of  these  two  tasks,  the  overhead  is  acceptable.  Sec¬ 
ondly,  the  time  overhead  of  these  two  tasks  are  insensitive 
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Order 

Dataset 

select-project- zoom 

select- zoom-project 

project-select-zoom 

project- zoom-select 

zoom-proj  ect-  select 

zoom-select-project 

select 

zoom 

proj 

select 

zoom 

proj 

select 

zoom 

proj 

select 

zoom 

proj 

select 

zoom 

proj 

select 

zoom 

proj 

100-60k-50m 

72.98 

0.006 

0.001 

84.55 

0.006 

0.001 

84.52 

0.005 

0.001 

2.39 

0.02 

0.001 

2.45 

0.02 

0.001 

2.40 

0.02 

0.001 

100-80k-50m 

71.63 

0.011 

0.001 

82.65 

0.007 

0.001 

84.73 

0.007 

0.001 

2.44 

0.02 

0.001 

2.53 

0.02 

0.001 

2.39 

0.02 

0.001 

100-100k-50m 

80.48 

0.009 

0.001 

77.25 

0.008 

0.001 

77.65 

0.009 

0.001 

2.39 

0.03 

0.001 

2.41 

0.03 

0.001 

2.41 

0.03 

0.001 

100-120k-50m 

84.28 

0.008 

0.001 

85.11 

0.009 

0.001 

84.71 

0.009 

0.001 

2.51 

0.03 

0.001 

2.53 

0.03 

0.001 

2.46 

0.03 

0.001 

100-140k-50m 

82.16 

0.010 

0.001 

84.73 

0.010 

0.001 

85.13 

0.010 

0.001 

2.52 

0.04 

0.001 

2.57 

0.04 

0.001 

2.57 

0.03 

0.001 

20-100k-50m 

16.32 

0.005 

0.001 

16.21 

0.005 

0.001 

15.48 

0.005 

0.001 

0.55 

0.02 

0.001 

0.55 

0.02 

0.001 

0.51 

0.24 

0.001 

40-100k-50m 

33.43 

0.006 

0.001 

30.85 

0.007 

0.001 

31.13 

0.007 

0.001 

0.99 

0.02 

0.001 

0.99 

0.02 

0.001 

0.99 

0.02 

0.001 

60-100k-50m 

48.37 

0.008 

0.001 

46.31 

0.007 

0.001 

46.53 

0.008 

0.001 

1.46 

0.02 

0.001 

1.46 

0.02 

0.001 

1.44 

0.03 

0.001 

80-100k-50m 

64.34 

0.008 

0.001 

62.10 

0.008 

0.001 

62.09 

0.007 

0.001 

2.12 

0.03 

0.001 

2.04 

0.03 

0.001 

2.04 

0.03 

0.001 

100-100k-50m 

80.48 

0.009 

0.001 

77.25 

0.008 

0.001 

77.65 

0.009 

0.001 

2.39 

0.03 

0.001 

2.41 

0.03 

0.001 

2.41 

0.03 

0.001 

100- 100k- 10m 

18.36 

0.007 

0.001 

18.97 

0.006 

0.001 

18.13 

0.006 

0.001 

0.62 

0.04 

0.001 

6.23 

0.02 

0.001 

0.59 

0.03 

0.001 

100-100k-20m 

36.83 

0.006 

0.001 

36.56 

0.007 

0.001 

36.66 

0.007 

0.001 

1.16 

0.05 

0.001 

1.19 

0.03 

0.001 

1.29 

0.03 

0.001 

100-100k-30m 

39.86 

0.008 

0.001 

39.44 

0.008 

0.001 

39.98 

0.008 

0.001 

1.17 

0.03 

0.001 

1.30 

0.03 

0.001 

1.30 

0.03 

0.001 

100-100k-40m 

77.29 

0.009 

0.001 

76.71 

0.008 

0.001 

74.61 

0.008 

0.001 

2.22 

0.03 

0.001 

2.47 

0.03 

0.001 

2.37 

0.03 

0.001 

100-100k-50m 

80.48 

0.009 

0.001 

77.25 

0.008 

0.001 

77.65 

0.009 

0.001 

2.39 

0.03 

0.001 

2.41 

0.03 

0.001 

2.41 

0.03 

0.001 

Table  5:  Running  time  composition  of  different  query  orders  (time  unit:  second) 


(a)  Datasets  with  different  #events  (b)  Datasets  with  different  #types 


(c)  Datasets  with  different  #ts 


Figure  6:  Running  time  of  storing  and  recovering  tasks  for  datasets  with  different  #events,  #types,  #ts 


to  #events  but  sensitive  to  #types  and  #ts.  As  we  drill  down 
to  the  operator  level,  we  find  that  most  of  the  increased  run¬ 
ning  time  comes  from  the  encode  operator  in  storing  task  and 
decode  operator  in  recovering  task.  In  our  implementation, 
both  of  these  two  operations  have  the  same  time  complexity 
0(|£|  |Xj  log  |T|).  The  running  time  would  increase  if  either 
\S\  or  |Tj  increases.  Also,  the  distribution  of  event  occur¬ 
rences  is  another  factor  to  affect  the  running  time. 

6.2.3  Performance  of  SF  Maintenance  In  this  section, 
we  investigate  the  performance  of  maintenance  tasks  on 
two  aspects:  how  the  characteristics  of  events  and  how 
the  resolution  of  data  affects  the  performance.  Similar 
to  previous  experiments,  we  evaluate  the  performance  of 
both  tasks  using  the  same  three  groups  of  datasets.  For 
each  group,  we  convert  the  first  dataset  into  a  SF,  and 
incrementally  update  and  merge  other  datasets.  Moreover, 
we  evaluate  updating  or  merging  tasks  by  storing  the  SFs 
with  7  different  resolutions.  Figure  |7(aj|  and  [7(bj1  illustrate 
the  results  for  merging  and  updating  task.  The  time  cost 
of  the  merging  task  is  sensitive  to  the  resolution  but  the 
updating  task  is  not.  In  a  high  resolution,  the  merging  task  is 
more  efficient  than  the  updating  task.  The  reason  is  that  the 
updating  task  uses  the  time-consuming  operator  encode  but 
the  merging  task  does  not. 

6.3  An  Illustrative  Case  Study  To  demonstrate  how 
META  facilitates  summarization,  we  list  3  tasks  (1st  row) 
as  well  as  the  corresponding  statement  (2nd  row)  in  Figured] 
to  show  how  the  analysts  work  on  security-win  dataset.  We 
also  attach  corresponding  summarization  results  (3rd  row) 


(a)  Updating  task 


(b)  Merging  task 


Figure  7 :  Performance  of  maintenance  tasks 


Task  1 _  _ Task  2 _  _ Task  3 


a _ ) 

a _ ) 

a _ ) 

Summarize  the  security-win 
log  in  day  granularity 

Drill  down  to  view  summary 

with  event  type  (538,  540, 
576,  858  and  861)  between 
11/01/2011  and  11/29/2011  in 
hour  granularity 

f1 

update  the  summary  with  new 

log  recorded  from  11/30/2011 
to  01/04/2012  and  then 
summary  them  altogether  in 
minute  granularity 

(1)  store  security-win  as 

SSF  with  resolution  13 

(2)  describe  SSF  zoom  to 
resolution  16 

(3)  describe  (select  538,540,  ^ 

576,  858,  861  from  SSF 
between  01/11/2011  and 
29/11/2011  zoom  to 
resolution  6) 

(4)  update  SSF  with  new- 
security-log 

(5)  Execute  (3)  again  by 
changing  resolution  to  13 

§n 

4^ 

Figure  8:  Summarize  with  META 
by  implementing  the  describe  operator  according  to  (4j  Q. 

As  shown  in  Figure  [8]  the  analysts  only  need  to  write 
one  or  two  commands  for  each  task.  All  the  details  are 
handled  by  the  framework.  Besides  convenience,  META 
also  improves  the  reusability  of  data  due  to  the  SF’s  natural 
property.  Once  the  security-win  log  is  stored  in  SF,  it  is 


1  source  code  is  available  at  http://users.cs.fiu.edu/  yjian()04/#codes 
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directly  available  for  all  the  3  tasks,  and  there  is  no  need 
to  generate  or  maintain  any  intermediate  data. 

Without  META,  the  analysts  need  to  write  programs  on 
their  own  to  conduct  the  data  transformation  and  extraction. 
Taking  task  2  for  instance,  the  analysts  should  write  several 
programs  to  transform  the  events  in  hourly  resolution,  to  pick 
out  the  records  related  to  the  event  types  538,  540,  576,  858, 
861,  and  to  extract  the  records  occurring  between  1 1/01/201 1 
and  1/29/2011.  The  analysts  would  do  similar  tedious  work 
when  facing  the  other  two  tasks. 

7  Conclusions 

We  present  our  research  efforts  on  establishing  -  META  -  an 
integrated  event  summarization  framework.  To  facilitate  the 
multi-resolution  analysis  of  the  events,  we  store  the  events  in 
the  form  of  summarization  forest.  Also,  we  propose  a  set  of 
atomic  operations  on  top  of  the  data  model  and  a  set  of  sum¬ 
marization  tasks  to  ease  the  work  of  the  analysts.  The  exper¬ 
iment  results  demonstrated  the  efficiency  and  effectiveness 
of  our  proposed  framework.  For  the  future  work,  we  will  ex¬ 
tend  event  summarization  techniques  to  support  distributed 
systems,  which  generate  the  events  in  more  complicated  en¬ 
vironments  at  a  much  larger  scale.  Moreover,  we  will  use  the 
summarization  results  for  automatic  problem  determination. 

Appendix 
Proof  of  I.em mal3.2t 

Proof.  We  calculate  the  number  of  zero  nodes  from  the  bottom 

level  to  the  top  level.  It  is  trivial  to  know  that  besides  the  root  level, 

in 

the  number  of  nodes  at  height  h  is  rih  =  2h+i  ■  F°r  each  level,  the 
number  of  zero  nodes  Zh  equals  to  the  number  of  nodes  rih  minus 
the  number  of  non-zero  nodes  uh- 

We  start  with  h  =  0  (the  leaf  level).  In  the  worst  case,  the 
event  occurrences  are  uniformly  distributed  along  the  time-line. 
There  are  two  cases  according  to  r: 

1.  0  <  r  <  |.  The  event  occurs  in  less  than  half  of  the 
time  slots.  In  such  condition,  uq  =  min(r|T|, no),  and 
z0  =  n0  -  mo  =  ^  -  r\T\.  So  p0  =  ^  =  1  -  2r. 

2.  |  <  r  <  1.  The  number  of  zero  nodes  at  the  leaf  level  can  be 
0.  Since  occurrences  are  uniformly  distributed,  it  is  possible 
that  the  event  appears  at  least  once  in  every  two  continuous 
time  slots.  In  this  case,  po  =  0. 

Therefore,  the  lower  bound  probability  of  the  zero  nodes  at  the 
leaf  level  is  po  =  max(l  —  2r, 0).  When  h  =  1,  in  the 
worst  case,  the  occurrences  of  non-zero  nodes  at  the  leaf  level 
are  still  uniformly  distributed,  so  u i  =  min(u0,ni).  Therefore, 
Zi  =  m  —  wi  =t  max(m  —  Uo ,  0)  and  pi  =  max(l  —  22r,  0). 
When  h  >  1,  if  the  occurrences  of  non-zero  nodes  at  a  lower 
level  is  still  uniformly  distributed,  the  number  of  zero  nodes  u h  = 
min(tife_i,  nh).  Similar  to  the  case  of  h  =  1,  Zh  =  rih  —  Uh,  and 
Ph  =  max(l  —  2h+1r,  0). 
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